Back

npj Digital Medicine

Springer Science and Business Media LLC

Preprints posted in the last 90 days, ranked by how well they match npj Digital Medicine's content profile, based on 97 papers previously published here. The average preprint has a 0.23% match score for this journal, so anything above that is already an above-average fit.

1
Wearables Anticipate Postoperative Complications: A Prospective Cohort Study

Lederer, L.; Roghanizad, A. R.; Howell, T. C.; Turnage, K.; Blazer, D. G.; Knackstedt, R.; Hwang, S.; Dunn, J.

2026-06-03 surgery 10.64898/2026.06.02.26354556 medRxiv
Top 0.1%
73.1%
Show abstract

Consumer wearable devices enable continuous passive physiologic monitoring in free-living conditions, yet their capacity to detect early postoperative deterioration following hospital discharge remains poorly characterized. Here we report a prospective observational cohort study evaluating multimodal wearable-derived physiologic signals across the perioperative period in adults undergoing elective oncologic surgery at Duke University Health System. Participants were monitored using an Oura Ring Gen 2 and Garmin Vivosmart 4 from at least two weeks preoperatively through up to 90 days postoperatively, alongside daily electronic patient-reported pain surveys. Devices captured 3,705 participant-days and 82,833 hours of physiologic data across 46 surgical patients. Oura adherence averaged 21.0 hours/day and was significantly higher than Garmin throughout the study period (17.6 hours/day). Garmin wear time declined significantly following surgery, while Oura adherence remained comparatively stable. Postoperative complications occurred in 17 participants (37%), including 10 major complications (Clavien-Dindo grade IIIb or higher) with a median onset of 13 days after surgery. Patients with major complications demonstrated significantly greater peak deviations from baseline in the first 10 postoperative days across resting heart rate, sleep temperature deviation, and readiness metrics. In the days before clinically documented major complications, wearable and patient-reported signals diverged from those of participants without major complications, with reduced activity appearing as early as four days before the event, followed by higher reported pain and later elevations in resting heart rate and sleep temperature deviation. These findings support the feasibility of prolonged perioperative wearable monitoring and suggest that physiologic deterioration preceding major surgical complications may be detectable days before clinical documentation, motivating further development and validation of wearable-based postoperative surveillance strategies.

2
Learning Patient-Specific Event Sequence Representations for Clinical Process Analysis

Solyomvari, K.; Antikainen, T.; Moen, H.; Marttinen, P.; Renkonen, R.; Koskinen, M.

2026-03-30 health informatics 10.64898/2026.03.25.26348333 medRxiv
Top 0.1%
61.3%
Show abstract

Healthcare system performance evaluation is constrained by episodic performance indicators and process mining techniques that fail to accommodate the scale, heterogeneity, and temporal complexity of real-world clinical pathways. Electronic health records enable reconstructing patient journeys that capture how care processes unfold across fragmented healthcare services. Here we present ClinicalTAAT, a time-aware transformer that bridges clinical sequence modeling and process mining by integrating contextual and time-varying information to learn interpretable patient-specific representations from inherently sparse, irregular and high-dimensional clinical event sequences. Evaluated on a large pediatric emergency cohort, ClinicalTAAT outperforms existing models in acuity and diagnosis classification, identifies clinically meaningful patient subgroups in heterogenous population with distinct acuity, resource utilization and diagnostic patterns, and detects anomalies in individual care trajectories. These findings demonstrate that time-aware transformers can complement existing process mining methodologies and serve as foundation models for clinical process analysis, providing a scalable framework for data-driven healthcare evaluation and optimization.

3
Claim-Level Transparency Analysis of LLM-Generated Diagnostic Reports: A Metabolic and Endocrine Biomarker Study

Yasinetsky, A.; Ikonomovska, E.; Geniesse, C.; Yasinetsky, A.

2026-05-06 systems biology 10.64898/2026.05.03.721751 medRxiv
Top 0.1%
60.9%
Show abstract

Large language models are increasingly deployed in clinical decision-support contexts, yet systematic evaluation of their factual reliability in generating patient-specific diagnostic reports remains sparse, particularly for laboratory interpretation tasks. This study presents a controlled transparency experiment in which four frontier LLMs -- Claude Sonnet 4.6, Claude Opus 4.6, GPT-5.2, and Gemini 3.1 Pro -- each generated diagnostic reports for 36 patients (29 female, 7 male; aged 27-64) with biomarker profiles spanning metabolic, endocrine, and nutritional markers. A transparency engine1 extracted up to 50 claims per report (3,035 total), searched for supporting scientific evidence, and classified each claim as supported by science, plausible, or unsupported. Unsupported claims were uncommon: the transparency engine classified 2.7% of claims as unsupported (hereafter, the pipeline-measured hallucination rate; naive claim-level 95% Wilson CI: 2.2%-3.4%), with GPT-5.2 at the lowest observed rate (1.7%) and Claude Opus 4.6 at the highest (3.6%). However, mechanistic verification revealed a much larger plausibility gap: 915 claims (30.2%) were biologically reasonable but lacked a fully verified evidence chain, bringing the share of claims not fully supported by direct evidence to 32.9%. Gemini 3.1 Pro produced the highest plausible proportion (39.6%), suggesting a more conservative but less fully grounded reasoning profile. Although coarse support-level distributions were broadly similar across models (Cramers V = 0.081), claim-level analysis revealed substantial narrative divergence: 61.2% of claims were unique to a single model, and matched-claim agreement was low (Cohens kappa = 0.233), indicating that models generate substantively different clinical narratives for the same patient data despite comparable aggregate support profiles. These findings show that hallucination metrics alone understate the share of claims not fully verified under this protocol, and that claim-level mechanistic verification is needed to distinguish the proven from the merely plausible in metabolic and endocrine laboratory interpretation, with generalizability to other clinical domains requiring further study.

4
Self-Reported Side Effects of Semaglutide and Tirzepatide in Online Communities

Sehgal, N. K. R.; Tronieri, J. S.; Ungar, L.; Guntuku, S. C.

2026-03-13 health informatics 10.64898/2026.03.12.26348253 medRxiv
Top 0.1%
58.4%
Show abstract

Social media can reveal patient experiences with glucagon-like peptide-1 receptor agonists (GLP-1 RAs) that extend beyond clinical trial data. We analyzed 410,198 Reddit posts (May 2019-June 2025) mentioning semaglutide or tirzepatide. A total of 67,008 users self-reported using these medications, and 43.5% described at least one side effect. Gastrointestinal symptoms predominated, including nausea (36.9%), fatigue (16.7%), vomiting (16.3%), constipation (15.3%), and diarrhea (12.6%). Notably, reproductive symptoms (e.g., menstrual irregularities) and temperature-related complaints (e.g., chills, hot flashes) emerged as unrecognized potential effects. These findings highlight patient concerns not well captured in current labeling or trials. Large-scale social media analysis can complement traditional pharmacovigilance by detecting emerging safety signals and expanding understanding of the real-world safety profile of GLP-1 RAs.

5
Cognitive AI-Assisted Primary Care Health Delivery: A Pilot Study in Bangladesh

Kabir, R. A.; Williams, M.; Rayhan, N.

2026-04-05 public and global health 10.64898/2026.04.03.26349253 medRxiv
Top 0.1%
58.2%
Show abstract

Research has documented persistent physician workforce shortages globally, with projected shortfalls threatening primary care access in underserved populations. Existing AI applications in healthcare have largely focused on predictive risk-scoring tools that generate probability estimates but do not reduce the time a physician spends completing a patient encounter. A January 2025 study further demonstrated that large language models lack the metacognitive capacity necessary for reliable medical reason ing, i.e., being able to ask appropriate questions in the absence of information to collect patient history and update differential diagnoses. This paper reports on a 2025 pilot deployment of ClinicalAssist in Bangladesh that tested a fundamentally different model: An AI system designed to replicate every step of the clinical workflow. Across 239 unique patients, 277 encounters, and 287 diagnostic opportunities, the system achieved an overall diagnostic accuracy of 94.7%, with chronic disease accuracy of 98.0% and acute care accuracy of 88.9%. These results suggest that cognitive AI has the potential to be a powerful clinical force multiplier if properly integrated in workflow.

6
From Concept to Clinic: Real World Evidence for Autonomous AI Deployment in Primary Care Telemedicine

Saenz, A. D.; Schumacher, E.; Naik, D.; Khosla, N.; Kannan, A.

2026-03-20 health informatics 10.64898/2026.03.18.26348749 medRxiv
Top 0.1%
58.1%
Show abstract

Systems powered by large language models are widely used for health information and advice, yet robust evidence for their safety and effectiveness in real-world clinical care remains lacking. Most existing studies evaluate general-purpose chatbots in artificial settings, failing to account for the critical role of system design, deployment context, and integrated safety mechanisms. Here, we report, to our knowledge, the first large-scale, clinician-blinded, real-world evaluation of a multi-agent LLM-based system deployed within a nationwide U.S. primary care telemedicine platform, assessing readiness for task-specific autonomous deployment. In 2,379 real patient encounters, where users actively sought medical care and completed full visits with licensed clinicians, we compared the AI system's intake diagnoses and disposition suggestions to those of treating clinicians, who were blinded to the AI's outputs. The AI's top-1 diagnosis matched the clinician's diagnosis in 91.3% of cases overall, increasing to 96.3% among cases meeting a pre-specified safety confidence threshold, and 97.9% in common, lower-complexity conditions that met the same confidence threshold. Disposition accuracy was similarly high, with an overall error rate of 2.5% and no errors in suggestions to emergency room or home management. These results demonstrate that purposeful system architecture, rather than model capability alone, is essential for safe and effective autonomous clinical AI. We propose a staged, task-calibrated deployment framework, in which AI can be introduced autonomously for well-defined tasks with explicit safety gating and continuous monitoring, expanding scope as real-world evidence accrues. Our findings provide the first real-world evidence of readiness for safe autonomous clinical AI and offer a practical roadmap for its responsible deployment at scale.

7
Dissecting clinical reasoning failures in frontier artificial intelligence using 10,000 synthetic cases

Auger, S. D.; Varley, J.; Hargovan, M.; Scott, G.

2026-04-23 neurology 10.64898/2026.04.22.26351488 medRxiv
Top 0.1%
56.3%
Show abstract

BackgroundCurrent medical large language model (LLM) evaluations largely rely on small collections of cases, whereas rigorous safety testing requires large-scale, diverse, and complex cases with verifiable ground truth. Multiple Sclerosis (MS) provides an ideal evaluation model, with validated diagnostic criteria and numerous paraclinical tests informing differential diagnosis, investigation, and management. MethodsWe generated synthetic MS cases with ground-truth labels for diagnosis, localisation, and management. Four frontier LLMs (Gemini 3 Pro/Flash, GPT-5.2/5-mini) were instructed to analyse cases to provide anatomical localisation, differential diagnoses, investigations, and management plans. An automated evaluator compared these outputs to the ground-truth labels. Blinded subspecialty experts validated 70 cases for realism and automated evaluator accuracy. We then evaluated LLM decision-making across 1,000 cases and scaled to 10,000 to characterise rare, catastrophic failures. ResultsSubspecialist expert review confirmed 100% synthetic case realism and 99.8% (95% CI 95.5-100) automated evaluation accuracy. Across 1,000 generated MS cases, all LLMs successfully included MS in the differential diagnoses for >91% cases. However, diagnostic competence did not associate with treatment safety. Gemini 3 models had low rates of clinically appropriate steroid recommendations (Flash: 7.2% [95% CI 5.6-8.8]; Pro: 15.8% [13.6-18.1]) compared to GPT-5-mini (23.5% [20.8-26.1]), frequently overlooking contraindications like active infection. OpenAI models inappropriately recommended acute intravenous thrombolysis for MS cases (9.6% GPT-5.2; 6.4% GPT-5 mini) compared to <1% for Gemini models. Expanded evaluation (to 10,000 cases) probed these errors in detail. Thrombolysis was recommended in 10.1% of cases lacking symptom timing information and paradoxically persisted (2.9%) even when symptoms were explicitly documented as >14 days old. ConclusionAutomated expert-level evaluation across 10,000 cases characterised artificial intelligence clinical blind spots hitherto invisible to small-scale testing. Massive-scale simulation and automated interrogation should become standard for uncovering serious failures and implementing safety guardrails before clinical deployment exposes patients to risk. 1-2 SENTENCE DESCRIPTIONBy scaling an expert-validated simulation process to 10,000 cases, this study demonstrates that high diagnostic accuracy by AI can mask rare but dangerous safety failures. This large-scale approach provides a framework for uncovering clinical "blind spots" that small-scale evaluations miss, helping inform the development of safety guardrails before AI is deployed in practice.

8
Free-text MAUDE narratives provide a source-robust representation layer for biomaterial-device surveillance

Chen, H.

2026-05-05 health informatics 10.64898/2026.05.03.26352339 medRxiv
Top 0.1%
53.5%
Show abstract

Implantable biomaterial devices require effective post-market surveillance because clinically important failure patterns often emerge only after widespread use. However, surveillance workflows often rely on structured coded summaries that compress heterogeneous adverse-event narratives into coarse categories. This study compares coded and free-text narrative representations across 1,500 FDA MAUDE reports from three biomaterial device classes (coronary stents, bone cement, and surgical mesh) to test whether narratives preserve a more source-robust surveillance representation. Under manufacturer-held-out evaluation, narrative TF-IDF features outperformed structured code-only features (macro F1 0.925 versus 0.827), while delexicalized narratives retained strong grouped performance after masking device-class, manufacturer, brand, and legal-template tokens (F1 0.897). Narrative topics resolved reported events into procedural, anatomical, host-response, and reporting-context patterns, and an interpretable classifier recovered code-derived complication phenotypes from narrative text alone (mean F1 0.902, AUC 0.967). These findings support free-text adverse-event narratives as a complementary representation layer for post-market device surveillance, while remaining bounded by passive adverse-event reporting limitations and requiring validation across additional years, device classes, and independently adjudicated outcomes. Author SummaryWhen an implanted medical device fails inside a patient, the event is reported to the FDAs MAUDE database. Each report includes both a standardized code and a written narrative describing what happened. We asked whether these two representations carry the same information. Using 1,500 reports covering coronary stents, bone cement, and surgical mesh, we found that coded fields lose much of the clinical detail present in narratives. Importantly, narrative-based classifiers remained accurate even when tested on reports from manufacturers not seen during training, while code-based classifiers dropped substantially. This matters because real-world surveillance must generalize across different reporting sources. We also found that narrative text can recover clinically meaningful complication patterns that are defined by codes, and that most reports never name the specific biomaterial involved. These findings suggest that narrative text deserves a more central role in post-market device monitoring, complementing the coded fields that current surveillance pipelines rely on.

9
Exploring the Interpretability of AI Decision Support Systems for Surgical Anatomy Recognition

Khan, D. Z.; Adams, T.; Wijekoon, A.; Ramirez Herrera, R.; Bano, S.; McCulloch, P.; Stoyanov, D.; Clarkson, M. J.; Costanza, E.; Blandford, A.; Marcus, H.; CARES Evaluation Group,

2026-06-03 surgery 10.64898/2026.06.02.26354729 medRxiv
Top 0.1%
53.3%
Show abstract

Artificial intelligence (AI) decision support systems for surgery hold promise but face barriers to adoption, particularly around the interpretability of their outputs. We conducted an international cross-sectional survey of 47 neurosurgeons to evaluate perspectives on literature-derived explanation techniques for AI-generated anatomical segmentations, using endoscopic pituitary surgery as a high-risk exemplar. Participants ranked certainty scores, certainty maps, saliency maps, scene similarity scores, and nearest-neighbour illustrations, and rated them using a modified Explanation Satisfaction Scale alongside free-text feedback. Certainty-based techniques were consistently ranked and rated highest for interpretability - valued for aligning with surgical decision-making by conveying confidence (via scores) and anatomical boundaries (via maps). Saliency- and similarity-based methods were judged less clinically relevant and better suited to educational settings. Certainty-based explanations, therefore, appear most acceptable to surgeons for clinical integration of decision support systems, though their impact on AI acceptability, trust calibration, and performance requires prospective evaluation across surgical domains.

10
Evaluating Sycophancy in Frontier Models Using Persona-Driven Challenge

Hazare, N. S.; Goel, N.; Yu, C.; Agaron, S.; Sharma, A.; Parchure, P.; Patel, D.; Timsina, P.; Kaplan, B.; Lampert, J.; Vakil, A.; Kovatch, P.; Darrow, B.; Glicksberg, B. S.; Charney, A.; Nadkarni, G. N.; Sakhuja, A.

2026-05-20 health informatics 10.64898/2026.05.17.26353406 medRxiv
Top 0.1%
53.0%
Show abstract

Large language models (LLMs) are increasingly used for lay health queries, yet may abandon correct recommendations under pressure, a vulnerability termed sycophancy. We evaluated sycophancy across five frontier LLMs (Claude Opus 4.6, Claude Sonnet 4.6, GPT 5.4, Grok 4.1, Gemini 3 Flash) using 200 synthetic clinical vignettes, each anchored to a unanimous correct treatment baseline and challenged by nine personas representing both vulnerable and authority roles. Overall, 7.1% of responses were sycophantic, varying tenfold across personas (1.7 to 19.3%) and sixfold across LLMs (2.4 to 15.3%). Vulnerable personas elicited more sycophantic responses, with medical student highest at the highest rate (19.3%). In adjusted Generalized Estimating Equations models, vulnerable personas continued to be independent predictors of sycophantic responses, which is a reversal of the expected authority gradient. In adjusted GEE models, persona and LLM were both independent predictors for sycophantic responses. Persona driven sycophancy evaluation should be integrated into pre deployment safety assessment of clinical LLMs.

11
Digital journaling enables privacy-preserving behavioral phenotyping and real-time risk monitoring at scale

Milham, M.; Low, D.; Erkent, A.; Trabulsi, J.; Kass, M. C.; Vos de Wael, R.; Yenepalli, S.; Wang, Y.; Leyden, M.; Jordan, C.; Salum, G.; Alexander, L.; Schubiner, G.; Hendrix, L.; Koyama, M.; Mears, L.; McAdams, R.; White, C.; Merikangas, K.; Satterthwaite, T. D.; Franco, A.; Klein, A.; Koplewicz, H.; Leventhal, B.; Freund, M.; Kiar, G.

2026-04-08 psychiatry and clinical psychology 10.64898/2026.04.04.26349881 medRxiv
Top 0.1%
52.3%
Show abstract

Digital mental health applications enable high-frequency behavioral monitoring and scalable interventions. Journaling provides a therapeutically grounded and intrinsically engaging activity for many users. AI-based text analysis enables privacy-preserving phenotyping of clinically relevant patterns in naturalistic writing, including emotional distress and behavioral risk (e.g., indicators of intent, planning, or preparatory actions for harm to self or others). We evaluated a mobile journaling platform in an 8-week randomized controlled trial (N = 507) of young adults with mild-to-moderate anxiety and depression symptoms. Journaling produced modest reductions in anxiety relative to controls at the 8-week endpoint and 1-month follow-up (d = 0.16-0.19). Effects were small and did not remain significant after correction for multiple comparisons; complementary Bayesian models nonetheless provided moderate-to-strong directional evidence (90-97%) supporting a modest anxiety reduction. In parallel, behavioral phenotyping analyses showed that high-risk journal entries were more common among younger users (OR = 0.77 per year of age, p = 0.007). Text-based risk signals and self-reported energy exhibited significant circadian variation (e.g., risk probability was highest during late-night and overnight hours). Within-person analyses demonstrated strong short-term persistence in mood and risk states, with calm/relaxed showing the highest persistence and anxious/agitated exhibiting the lowest persistence. High-risk journal entries clustered temporally and were preceded by sustained low valence and energy. Although affective volatility was associated with acute declines within the same affective dimension (pleasantness or energy), it was not associated with escalation to high-risk states. Key behavioral dynamics observed in the trial were replicated in an independent general population dataset (N = 16,630). Collectively, these findings demonstrate that privacy-preserving digital journaling can support scalable longitudinal behavioral phenotyping and real-time risk monitoring while providing modest clinical benefit for anxiety symptoms.

12
Explainable AI for Data-Driven Design of High-Dimensional Predictive Studies

Yan, J.; Machlanski, D.; Butler, K.; Dimitrakopoulos, P.; Harrison, E. M.; Guthrie, B. M.; Tsaftaris, S. A.

2026-05-24 health informatics 10.64898/2026.05.21.26353781 medRxiv
Top 0.1%
52.2%
Show abstract

Predictive modelling is important for health data analysis and data-driven clinical decision-making. However, predictive studies are challenging to design optimally by hand when tens or even hundreds of features require selection, transformation, or interaction modelling. While complex machine learning models offer high performance, their "black-box" nature limits the clinical trust, transparency, and interpretability required for decision-making. We developed and evaluated an Exploratory AI Recommender that provides data-driven recommendations to improve predictive performance of existing interpretable statistical models. The developed framework uses flexible AI modelling to capture complex data patterns and explainable AI techniques to translate the patterns into three recommendation types: feature exclusion, non-linear terms, and feature interactions. We evaluated the framework by comparing predictive performance of a baseline (i.e., no interactions or non-linear terms) Cox Proportional Hazards (CPH) model against an augmented CPH incorporating recommendations suggested by our method. The primary analysis predicts the time to the first occurrence of a fall or related injury in 245,614 patients. Our method recommended excluding 23 features, including non-linear terms for two features, and including 221 suggested feature interactions. The C-index improved from 0.805 (95% CI 0.798-0.812) to 0.815 (95% CI 0.809-0.822), and so did calibration (intercept: -0.006 to 0.003; slope: 1.063 to 0.950). All recommendations were supported by existing literature. The method also proved effective on two additional public datasets, demonstrating wider applicability. The proposed Exploratory AI Recommender demonstrates the potential of explainable AI and data-driven study design to improve the process of developing, and the performance of high-dimensional transparent predictive models.

13
The Multimodal Anonymizer: a fully local multi-agent AI system for medical data deidentification

Hirsch, A.; Ten, F. W.; Krueger, K. S.; Geyer, R.; Roeschl, T.; Groeschel, M.; Rostin, P.; Eils, R.; Spott, M.; Prasser, F.; Meyer, A.; Madrid, J.

2026-06-05 health informatics 10.64898/2026.05.28.26353952 medRxiv
Top 0.1%
51.4%
Show abstract

Background: Safe reuse of multimodal hospital data for AI development is limited by the absence of reliable, context-aware deidentification across multimodal data and longitudinal patient data. Existing approaches are largely modality-specific and can indiscriminately remove clinically important information. Methods: We developed the Multimodal Anonymizer, a modular, locally deployable multi-agent framework integrating multimodal large language models, task-specific neural networks and rule-based transformations. We evaluated 16 orchestrator model configurations on a benchmark built from publicly available data and hospital data from our institution. The benchmark dataset included data from different origins: 250 MIMIC-IV patients with synthetically injected personally identifiable information (PII) supplemented with head CT, face images, handwriting, audio, German clinical-text datasets and local data. Primary outcomes were deidentification sensitivity and preservation of clinically important content; secondary analyses examined model characteristics, reproducibility, and performance against leading market and open-source solutions. Results: The best local configuration (the orchestrator being Qwen3-VL-235B-A22B-Thinking) achieved near-complete deidentification across all datasets, with per-patient sensitivity of 98.80% (95%-CI 97.20; 100), and per-PII sensitivity of 99.82% (95%-CI 99.76; 99.88). Critical clinical preservation was 99.60% (95%-CI 98.80; 100) per-patient, and clinical preservation was 99.61% (95%-CI 99.51; 99.71) per-file. All modalities achieved at least 98.30% sensitivity (lower bound 95%-CI). On our local data, the system achieved a deidentification sensitivity of 100% per-patient and per-PII; and a critical clinical preservation of 100% per-patient as well as a clinical preservation of 99.97% (95%-CI 99.91; 100) per-file. When comparing orchestrators, the leading local models were similar to proprietary models (GPT-5.2) in deidentification sensitivity while showing higher deidentification specificity. The Multimodal Anonymizer outperformed previous tools on most modalities. Conclusion: Near-complete, utility-preserving deidentification of multimodal clinical data is achievable with a unified, locally deployable multi-agent system, enabling safer large-scale reuse of hospital data for research and AI development.

14
Artificial Intelligence for Automated, Highly Accurate, and Scalable Multimodal EHR Data Abstraction

Margaritis, G.; Petridis, P.; Bertsimas, D.; Bloom, J.; Hagberg, R.; Habib, R.; Shahian, D. M.; Orfanoudaki, A.

2026-03-17 health informatics 10.64898/2026.03.16.26348522 medRxiv
Top 0.1%
49.7%
Show abstract

Electronic health records (EHRs) contain rich multimodal data but remain underutilized for populating clinical registries due to the time and cost of manual abstraction. We developed an AI-driven pipeline to automate data abstraction for variables in the Society of Thoracic Surgeons Adult Cardiac Surgery Database (ACSD). Models were developed using Mass General Brigham data and externally validated on Hartford HealthCare data. The pipeline processes ten clinical EHR sources, seven unstructured text types and three structured data types; each encoded using two language-model embeddings and term frequency-inverse document frequency. This approach yielded 30 source-specific models per target variable whose predictions were aggregated by an ensemble meta-learner, followed by a dual-threshold confidence framework that enforced registry-grade high accuracy standards and deferred uncertain predictions to human review. The developed pipeline achieved an overall accuracy exceeding 99% across 647 registry variables, while automatically completing 49.5% and 43.2% of variables at both sites, respectively. These results demonstrate that AI-assisted abstraction can substantially reduce clinical registry data collection burden while maintaining high accuracy.

15
Disease Risk Prediction Using Structured EHR Data: Can Generalist Large Language Models Match Specialized Clinical Foundation Models? A Comparative Evaluation with Fine-Tuning

Mao, B.; Prasadha, M. K.; Xie, Z.; He, J.; Ghebranious, M.; Xu, H.; Zhi, D.; Rasmy, L.

2026-05-01 health informatics 10.64898/2026.04.24.26351503 medRxiv
Top 0.1%
49.3%
Show abstract

BackgroundElectronic health records (EHRs) with clinical decision support tools are now ubiquitous in healthcare organizations. Clinical foundation models (CFMs) pretrained on large-scale, heterogeneous structured EHR data have emerged as a powerful approach to improve predictive performance and generalizability. Meanwhile, large language models (LLMs) pretrained on broad data sources are being applied to an expanding range of healthcare tasks. However, it remains unclear whether generalist LLMs can match specialized CFMs for disease risk prediction using structured clinical data. MethodsWe compared CFMs (Med-BERT, CLMBR) against fine-tuned generalist LLMs (Mistral, LLaMA-2/3/3.1), a clinical LLM (Me-LLaMA), and LLM-generated embeddings paired with simple classifiers (using DeepSeek, Qwen3, and GPT-OSS) on two disease risk prediction tasks: heart failure risk among diabetic patients (DHF) and pancreatic cancer diagnosis (PaCa). Evaluations spanned multi-site EHR data, claims data, and an open-source single-institution benchmark (EHRSHOT). Performance was assessed using the area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC). ResultsOn larger EHR and claims cohorts (>30,000 patients), fine-tuned CFMs outperformed fine-tuned LLMs by a small but statistically significant margin (<1% AUROC). The clinical LLM performed comparably to generalist LLMs despite being smaller. On the open-source PaCa cohort (3,810 patients, 199 cases), LLMs achieved slightly higher AUROCs that were not statistically significant (LLaMA-3.1-70B 86.1% vs. Med-BERT 85.3%, p=0.27), but CFMs achieved significantly higher AUPRC (Med-BERT 55.9% vs. LLaMA-3.1-70B 41.1%, p=0.001). Notably, LLM-generated trajectory embeddings paired with logistic regression or a simple MLP, without any LLM fine-tuning, achieved the best overall performance, with AUROC exceeding 90% (Qwen3) and AUPRC reaching 66% (GPT-OSS 20B). ConclusionLLM-generated embeddings with lightweight classifiers outperformed both fine-tuned CFMs and fine-tuned LLMs on AUROC and AUPRC. While these results demonstrate the potential of generalist models to match or surpass specialized CFMs, their substantially greater computational cost and variable AUPRC performance in the fine-tuning setting warrant caution. We provide a reproducible evaluation framework and codebase to support continued benchmarking.

16
An Explainable Multimodal AI Framework with Reinforcement Learning for Post-Surgical Clinical Decision Support

Ahmed, M.; Ahmed, F.; Mow, S. M.; Taha, P. A.; Barua, S.; Rahman, M. M.; Rafy, A.; Mondol, S. M.; Faisal, M. I.

2026-06-10 health informatics 10.64898/2026.06.08.26355217 medRxiv
Top 0.1%
48.1%
Show abstract

Post-surgical adverse outcomes, including mortality, intensive care readmission, and complications, remain major challenges for clinical decision-making. Existing machine learning approaches focus on outcome prediction while operating as opaque systems, limiting clinical trust and the translation of predictions into treatment decisions, and many clinical studies rely on synthetic data in which shared intermediate variables create circular dependencies between inputs and targets that compromise reported performance. We aimed to develop an explainable multimodal architecture and a rigorous evaluation methodology that address these gaps. We designed a two-stage architecture integrating supervised deep learning for risk prediction with conservative Q-learning for action recommendation. The first stage uses five modality-specific encoders for structured records, physiological time-series, chest radiographs, clinical notes, and surgical metadata, unified through cross-modal attention into a shared patient-state representation. The second stage applies offline reinforcement learning to recommend clinical actions while preventing value overestimation. We formally characterized a target-leakage flaw in synthetic pipelines and propose a real-data methodology using a verified clinical database, with event-censored temporal separation and uncertainty-weighted per-task training. Component-level behavior was validated on a controlled synthetic benchmark, demonstrating that the architecture functions as designed without claiming clinical validity. The cross-modal attention and risk-prediction components behaved as expected, whereas the offline reinforcement learning stage did not converge on the benchmark, indicating that value estimation requires further investigation on real clinical data. The architecture provides dual-level explainability through attention visualization and value decomposition, contributing a deployable design, a formal methodological critique of synthetic-data practices, and a complete framework for clinically valid evaluation.

17
Preliminary Reliability and Validity of SynapTrack, a Smartphone-Based Digital Biomarker Platform for Remote Assessment of Cervical Spondylotic Myelopathy

Yakdan, S.; Singh, P.; Arkam, F.; Chen, E.; Lewis, A.; Steel, B.; Becker, I.; Guo, W.; Naveed, H.; Wang, C.; Yang, D.; Wang, Z.; Ray, W. Z.; Hassenstab, J.; Steinmetz, M. P.; Ghogawala, Z.; Kelleher, C.; Greenberg, J.

2026-06-01 surgery 10.64898/2026.05.29.26354454 medRxiv
Top 0.1%
44.9%
Show abstract

Background and Objectives: Cervical spondylotic myelopathy (CSM) is a leading cause of neurological disability in older adults. However, validated, scalable tools to quantify disease severity and changes over time are lacking. Recent advances in smartphone technology have opened new avenues for longitudinal, objective, and remote monitoring of neurological conditions. We performed a preliminary evaluation of the reliability and validity of SynapTrack, a smartphone-based digital platform for objective remote CSM assessments. Methods: In this single-center prospective cohort study, 265 participants (151 with CSM, 114 healthy controls) completed in-person SynapTrack assessments related to tapping, pinching, and vibratory detection, along with reference laboratory measures of dexterity (Box and Block Test, 9-Hole Peg Test) and vibratory sensation (tuning fork). A subset completed repeated home-based testing to assess test-retest reliability. We evaluated convergent validity, construct validity against the modified Japanese Orthopedic Association (mJOA) score, known-groups validity, and test-retest reliability (intraclass correlation coefficient, ICC). Results: Smartphone-derived metrics demonstrated good-to-excellent test-retest reliability, with the strongest stability for vibratory detection threshold (ICC = 0.92), overall and non-dominant tapping speed (ICC = 0.90 each), and pinching successful targets (ICC = 0.90). Convergent validity was supported by moderate-to-strong correlations between digital metrics and reference laboratory dexterity tests ({rho} up to 0.60 for tapping speed; up to -0.65 for the vibratory threshold). Construct validity against the mJOA was strongest for the vibratory threshold ({rho} = -0.53 to -0.54) and Level 2 non-dominant pinching errors ({rho} = -0.45). Selected metrics distinguished CSM patients from controls with good discrimination, including non-dominant tapping speed (AUROC = 0.76, 95% CI 0.68-0.85), Level 2 dominant pinching successful targets (AUROC = 0.78, 95% CI 0.62-0.94), and the non-dominant vibratory threshold (AUROC = 0.77, 95% CI 0.64-0.90). Conclusions and Relevance: A smartphone-based battery of upper-extremity sensorimotor tasks demonstrated preliminary reliability and validity in CSM. Furthermore, to our knowledge, the novel vibratory detection task represents the first smartphone-based sensory assessment used for CSM. Collectively, these findings position SynapTrack as a scalable platform for objective, remote neurological monitoring of CSM.

18
Multinational Validation of the Intensive Documentation Index for ICU Mortality Prediction: Temporal Resolution and ICU Mortality

Collier, A.; Shalhout, S. Z.

2026-03-23 health informatics 10.64898/2026.03.19.26348852 medRxiv
Top 0.1%
43.1%
Show abstract

Clinical documentation timestamps generate a continuous, zero-burden behavioral signal in the electronic health record. We developed the Intensive Documentation Index (IDI) and validated it in two independent cohorts: MIMIC-IV (26,153 U.S. ICU heart failure patients, primary outcome in-hospital mortality) and HiRID (33,897 Swiss all-ICU patients, primary outcome ICU mortality). In MIMIC-IV, the IDI-enhanced logistic regression achieved an AUROC of 0.6491, compared with a baseline of 0.6242 (Brier score of 0.1299). In HiRID, where documentation latency is 1.2 minutes, compared with 15 hours in MIMIC-IV, AUROC was 0.9063, well above published APACHE IV and SAPS III benchmarks. The approximately 0.27 AUROC gap reflects the importance of temporal granularity in documentation-based risk stratification. IDI requires no physiologic measurements, making it complementary to established severity scores. Prospective validation in real-time EHR systems is required before clinical deployment.

19
A clinic-updated digital twin for Parkinson's disease progression: governed Bayesian forecasting with uncertainty-gated reporting

Hemedan, A. A.

2026-03-22 health informatics 10.64898/2026.03.19.26348807 medRxiv
Top 0.1%
42.8%
Show abstract

BackgroundClinical digital twins hold considerable promise for forecasting disease progression, yet the question of when a models outputs should be withheld remains largely unaddressed. A predictive model qualifies as a governed reporting system only when it specifies the operational boundaries under which its outputs are reliable and enforces criteria for suppressing results that fall outside those bounds. MethodsWe present a governed Bayesian digital twin for multi-domain Parkinsons disease (PD) progression, tracking motor function (MDS-UPDRS Part III), cognition (Montreal Cognitive Assessment, MoCA), and autonomic function (SCOPA-AUT). A monotone latent state-space model captures disease progression under four architectural constraints: non-decreasing latent severity, visit-triggered updating, full posterior uncertainty propagation, and non-causal scope. A six-rule confidence gate evaluates each forecast before release; when evidence is insufficient, the gate suppresses the output and returns a structured reason code. We evaluated the framework on the Parkinsons Progression Markers Initiative (PPMI), a multicentre longitudinal observational study (N=4,628 participants; 28,185 visits), using five-fold cross-validation with independent model refits, equity analysis, and coupling-topology sensitivity assessment. The framework is available at https://gitlab.com/ahmed.hemedan/symphony-dt, with a research prototype at https://symphony-dt.com/. ResultsPredictive interval coverage at the 95% level ranged from 94% to 96% across all three endpoints, compared with 64-69% for linear mixed-effects baselines. The confidence gate released governed forecasts at 32.7% of visits under strict three-domain requirements, increasing to 48.1% under a validated partial-observation extension. Suppression was predominantly driven by incomplete clinical assessment (51.5%) rather than model uncertainty (0.2%), and operated equitably across sexes (Cramers V=0.049). Five of six cross-domain coupling parameters were identified from the data (sign probability [&ge;] 0.99; contraction ratios 0.19-0.35), with all cross-domain forecast correlations matching the directions predicted by the coupling topology. The frameworks own diagnostics localised two observation-model limitations, Prodromal motor heteroscedasticity and medication-burden sensitivity, to a single model layer and specified their resolution. ConclusionsGoverned silence, defined as the rule-based suppression of predictions when reliability conditions are not met, can be embedded in clinical prediction architecture, quantified as a pipeline output, and audited for equity. This work demonstrates the technical executability of governed digital twin architecture at cohort scale and provides a foundation for prospective deployment under routine clinical conditions.

20
Multimodal prediction of visual improvement in diabetic macular edema using real-world electronic health records and optical coherence tomography images

Sun, S.; Cai, C. X.; Fan, R.; You, S.; Tran, D.; Rao, P. K.; Suchard, M. A.; Wang, Y.; Lee, C. S.; Lee, A. Y.; Zhang, L.

2026-04-24 health informatics 10.64898/2026.04.23.26351616 medRxiv
Top 0.1%
42.8%
Show abstract

Multimodal learning has the potential to improve clinical prediction by integrating complementary data sources, but the incremental value of imaging beyond structured electronic health record (EHR) data remains unclear in real-world settings. We developed a multimodal survival modeling framework integrating optical coherence tomography (OCT) and EHR data to predict time to visual improvement in patients with diabetic macular edema (DME), and evaluated how different ophthalmic foundation model representations contribute to prognostic performance. In a retrospective cohort of 973 patients (1,450 eyes) receiving anti-vascular endothelial growth factor therapy, we compared multimodal models combining 22,227 EHR variables with 196,402 OCT images, with OCT embeddings derived from three ophthalmic foundation models (RETFound, EyeCLIP, and VisionFM). The EHR-only model showed minimal prognostic discrimination (C-index 0.50 [95% CI, 0.45-0.55]). Incorporating OCT improved performance, with the magnitude of improvement depending on the representation. EHR+RETFound achieved the strongest performance (C-index 0.59 [0.54-0.65]), followed by EHR+EyeCLIP (0.57 [0.52-0.62]) and EHR+VisionFM (0.56 [0.51-0.61]). Multimodal models, particularly EHR+RETFound, demonstrated improved risk stratification with clearer separation of Kaplan-Meier curves. Partial information decomposition revealed that prognostic information was dominated by modality-specific contributions, with OCT and EHR providing largely distinct signals and minimal shared information. The magnitude of OCT-specific contribution varied across foundation models and aligned with observed performance differences. These findings indicate that OCT provides complementary prognostic value beyond structured clinical data, but gains are modest and depend strongly on representation choice. Our results highlight both the promise of multimodal modeling for personalized prognosis and the need for rigorous, context-specific evaluation of foundation models in real-world clinical settings.